MEDB 5501, Module11

2025-11-04

Topics to be covered

  • What you will learn
    • Analysis of variance is linear regression
    • R code for analysis of variance
    • Log transformation
    • R code for analysis of variance
    • Kruskal-Wallis test
    • R code for analysis of variance
    • Your homework

Indicator variables, 1

  • Two levels is less complex with three or more levels
    • R assigns 0 to first category (alphabetically)
    • R assigns 1 to second category
    • Examples:
      • Female=0, Male=1
      • Man=0, Woman=1
  • Interpretation
    • Intercept is estimated average outcome for first category
    • Slope is the estimated average change
      • Second category average minus first category average

Example: Turtle experiment, 1

Sex     Fed  Fasted10  Fasted20
Male   42.8      42.4      38.9
Male   43.1      42.2      40.3
Male   40.4      40.8      37.5
Male   46.6      45.9      42.9
Female 42.2      42.4      39.7
Female 38.7      38.1      35.8
Female 35.3      34.3      32.3
Female 40.5      40.1      37.3

Turtle experiment, 2

Sex     Fed  Fasted10  Fasted20 Indicator
Male   42.8      42.4      38.9         1
Male   43.1      42.2      40.3         1
Male   40.4      40.8      37.5         1
Male   46.6      45.9      42.9         1
Female 42.2      42.4      39.7         0
Female 38.7      38.1      35.8         0
Female 35.3      34.3      32.3         0
Female 40.5      40.1      37.3         0

Indicator variables, 2

  • With k levels, you need k-1 indicators
    • First indicator
      • R assigns 1 to second category (alphabetically)
      • R assigns 0 to all other categories
    • Second indicator
      • R assigns 1 to third category (alphabetically)
      • R assigns 0 to all other categories
    • And so on
  • Note: the first category in alphabetical order is zero for all indicator variables.

Example with dietary cracker data

Cracker Digested
control 1772.84
bran    1752.63
combo   2121.97
gum     2558.61
gum     2026.91
bran    2047.42
combo   2254.75
control 2353.21
combo   2153.36
gum     2331.19
bran    2547.77
...

Dietary cracker data, first indicator variable

Cracker Digested i1
control 1772.84   0
bran    1752.63   0
combo   2121.97   1
gum     2558.61   0
gum     2026.91   0
bran    2047.42   0
combo   2254.75   1
control 2353.21   0
combo   2153.36   1
gum     2331.19   0
bran    2547.77   0
...

Dietary cracker data, second indicator variable

Cracker Digested i1 i2
control 1772.84   0  1
bran    1752.63   0  0
combo   2121.97   1  0
gum     2558.61   0  0
gum     2026.91   0  0
bran    2047.42   0  0
combo   2254.75   1  0
control 2353.21   0  1
combo   2153.36   1  0
gum     2331.19   0  0
bran    2547.77   0  0
...

Dietary cracker data, third indicator variable

Cracker Digested i1 i2 i3
control 1772.84   0  1  0
bran    1752.63   0  0  0
combo   2121.97   1  0  0
gum     2558.61   0  0  1
gum     2026.91   0  0  1
bran    2047.42   0  0  0
combo   2254.75   1  0  0
control 2353.21   0  1  0
combo   2153.36   1  0  0
gum     2331.19   0  0  1
bran    2547.77   0  0  0
...

Interpretation with multiple indicator variables

  • Intercept is estimated average outcome for first category
  • First slope is the estimated change
    • Second category average minus first category average
  • Second slope is the estimated change
    • Third category average minus first category average
  • And so on

What if you want a different reference category?

Create your own indicator variables

dietary |>
  mutate(i1=as.numeric(Cracker=="bran")) |>
  mutate(i2=as.numeric(Cracker=="combo")) |>
  mutate(i3=as.numeric(Cracker=="gum")) -> dietary_1

or revise the order using the factor function.

new_order <- c("control", "bran", "combo", "gum")
dietary |>
  mutate(Cracker=factor(Cracker, levels=new_order)) -> dietary_2
lm()

Example using fruitfly lifespans, 1

  • Experiment with 125 cages
    • Does fruitfly mating affect average male lifespan?
    • Isolate a male fruitfly with
      • one or eight virgin females,
      • one or eight pregnant females, or
      • no females
    • Note: males will not mate with pregnant females

Example using fruitfly lifespans, 2

  • Cages 1-25 have one male fruitfly, no female fruit flies
  • Cages 26-50 have one male fruitfly, one pregnant female
  • Cages 51-75 have one male fruitfly, one virgin female
  • Cages 76-100 have one male fruitfly, eight pregnant females
  • Cages 101-125 have one male fruitfly, eight virgin females

Fruitfly lifespan boxplots

Fruitfly lifespan group means

# A tibble: 5 × 4
  cage                   longevity_mn longevity_sd     n
  <chr>                         <dbl>        <dbl> <int>
1 Eight pregnant females         63.4         14.5    25
2 Eight virgin females           38.7         12.1    25
3 No females                     63.6         16.5    25
4 One pregnant female            64.8         15.7    25
5 One virgin female              56.8         14.9    25

Break #1

  • What you have learned
    • Analysis of variance is linear regression
  • What’s coming next
    • R code for analysis of variance

R code for analysis of variance

Refer to the program simon-5501-11-fruitfly.qmd.

Break #2

  • What you have learned
    • R code for analysis of variance
  • What’s coming next
    • Log transformation

Analysis of variance model

  • Sample 1: \(Y_{11}, Y_{12}, \cdots, Y_{1n_1}\) are \(N(\mu_1, \sigma)\)
  • Sample 2: \(Y_{21}, Y_{22}, \cdots, Y_{2n_2}\) are \(N(\mu_2, \sigma)\)
  • \(\cdots\)
  • Sample k: \(Y_{k1}, Y_{k2}, \cdots, Y_{kn_k}\) are \(N(\mu_k, \sigma)\) \(\ \)
  • \(H_0:\ \mu_1=\mu_2=...=\mu_k\)
  • \(H_1:\ \mu_i \ne \mu_j\) for some i, j

Violation of assumptions

  • Non-normality
  • Heterogeneity
  • Lack of independence

When to consider a log transformation

  • Only positive values
  • Max/min > 3
  • Right (positive) skewed distribution
  • Groups with larger means have more variation

How logarithms convert multiplication and division

  • \(log(a \times b)=log(a)+log(b)\)
    • \(log(a / b)=log(a)-log(b)\)

The slide rule uses this property

The book of logarithms

What a back transformation does

  • \(antilog(a)+antilog(b)=antilog(a \times b)\)
  • \(antilog(a)-antilog(b)=antilog(a / b)\)
    • For log base ten, antilog(x) means \(10^x\)
    • For log base two, antilog(x) means \(2^x\)
  • A difference of means on the log scale becomes a ratio of means after back transformation.

Interpretation of the back transformed confidence intervals

  • A confidence interval is a range of plausible values.
    • A back transformed confidence interval is a range of plausible ratios.
  • Does the interval contain the ratio of 1?
    • The two groups are statistically the same.
  • Does the interval contain only values larger than 1?
    • The first group is statistically larger
  • Does the interval contain only values smaller than 1?
    • The first group is statistically smaller

Log transformation, 1

# A tibble: 10 × 7
   term  contrast             null.value estimate conf.low conf.high adj.p.value
   <chr> <chr>                     <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
 1 cage  One pregnant female…          0  1.02e-2  -0.0871    0.107      9.98e-1
 2 cage  Eight pregnant fema…          0  1.90e-4  -0.0971    0.0975     1.00e+0
 3 cage  One virgin female-N…          0 -5.19e-2  -0.149     0.0454     5.79e-1
 4 cage  Eight virgin female…          0 -2.25e-1  -0.322    -0.127      3.18e-8
 5 cage  Eight pregnant fema…          0 -9.99e-3  -0.107     0.0873     9.99e-1
 6 cage  One virgin female-O…          0 -6.21e-2  -0.159     0.0352     3.97e-1
 7 cage  Eight virgin female…          0 -2.35e-1  -0.332    -0.138      7.64e-9
 8 cage  One virgin female-E…          0 -5.21e-2  -0.149     0.0452     5.75e-1
 9 cage  Eight virgin female…          0 -2.25e-1  -0.322    -0.128      3.10e-8
10 cage  Eight virgin female…          0 -1.73e-1  -0.270    -0.0755     2.74e-5

Log transformation, 3

# A tibble: 10 × 4
   contrast                                    estimate conf.low conf.high
   <chr>                                          <dbl>    <dbl>     <dbl>
 1 One pregnant female-No females                 1.02     0.818     1.28 
 2 Eight pregnant females-No females              1.00     0.800     1.25 
 3 One virgin female-No females                   0.887    0.709     1.11 
 4 Eight virgin females-No females                0.596    0.477     0.746
 5 Eight pregnant females-One pregnant female     0.977    0.781     1.22 
 6 One virgin female-One pregnant female          0.867    0.693     1.08 
 7 Eight virgin females-One pregnant female       0.582    0.466     0.729
 8 One virgin female-Eight pregnant females       0.887    0.709     1.11 
 9 Eight virgin females-Eight pregnant females    0.596    0.476     0.746
10 Eight virgin females-One virgin female         0.672    0.537     0.841

Break #3

  • What you have learned
    • Log transformation
  • What’s coming next
    • R code for analysis of variance

R code for log transformed analysis of variance

Refer to the program simon-5501-11-fruitfly.qmd.

Break #4

  • What you have learned
    • R code for analysis of variance
  • What’s coming next
    • Kruskal-Wallis test

The Chi-squared distribution

  • Often denoted as \(\chi^2_{df}\)
  • Has a single parameter, degrees of freedom
    • Never negative
    • Skewed right
    • Mean equals degrees of freedom
  • Calculations in R
    • pchisq(x, df) = \(P[\chi^2_{df} < x]\)
    • qchisq(p, df) = \(p^{th}\) quantile, \(\chi^2_{df, p}\)

The Chi-squared distribution

Computing the Kruskal-Wallis test

  • Rank the observations, \(R(X_{ij})\)
    • 1 for smallest, 2 for second smallest, etc.
    • Compute the average rank in each group, \(\bar{R}_i\)
    • Compute the overall rank, \(\bar{R}\)
    • T = \((N-1)\frac{\Sigma n_i(\bar{R}_i - \bar{R})^2}{\Sigma \Sigma(R(X_{ij})-\bar{R})^2}\)

Decision rule for Kruskal-Wallis test, 1

  • Accept \(H_0\) if T <- \(\chi^2_{df, 1-\alpha}\)
    • df = k-1
  • Accept \(H_0\) if p-value > \(\alpha\)
    • p-value = \(P[\chi^2_{df} > T]\)

Decision rule for Kruskal-Wallis test, 1

  • Null hypothesis is difficult to define
    • Does not involve population means
    • Some claim involvement of population medians
    • Stochastic dominance
      • \(P[X_{aj} > X_{bj}] > 0.5\) for some a and b.

Application of Kruskal-Wallis test to fruitfly longevity

kruskal.test(longevity ~ cage, data=fly_1)

    Kruskal-Wallis rank sum test

data:  longevity by cage
Kruskal-Wallis chi-squared = 37.961, df = 4, p-value = 1.142e-07

Break #5

  • What you have learned
    • Kruskal-Wallis test
  • What’s coming next
    • R code for analysis of variance

R code for the Kruskal-Wallis test

Refer to the program simon-5501-11-fruitfly.qmd.

Break #6

  • What you have learned
    • R code for analysis of variance
  • What’s coming next
    • Your homework

Your homework

Refer to the file simon-5501-11-directions on my github site.

Summary

  • What you have learned
    • Analysis of variance is linear regression
    • R code for analysis of variance
    • Log transformation
    • R code for analysis of variance
    • Kruskal-Wallis test
    • R code for analysis of variance
    • Your homework